The 2019–20 coronavirus pandemic is an ongoing global pandemic of coronavirus disease. On 11 March 2020, the World Health Organization declared the outbreak a pandemic. As of May 4th 2020, over 3,500,000 cases have been confirmed in more than 120 countries and territories, with major outbreaks in United States and Europe.
Our project entails a thorough analysis over available data on COVID-19. Specifically, we conducted our data analysis over four questions of interest. They are
What is the infection trends of COVID-19 in the United States?
What might be the real patients number?
What is the best model to predict Number of Confirmed Cases?
What factors are playing crucial roles in determining whether a country is more likely to have a massive outbreak?
What insights can we obtain from the previous pandemic in better preparing global solutions for COVID-19?
Our data mainly come from Kaggle and WHO. The significance and related interest of such projects go without saying as millions of people’s well-being are under threat. A generalization of what we did in terms of modeling includes utilizing SIR model to simulate disease developing trend, featuring SVM model to predict total infection numbers and interactively visualizing infection map using plotly. We also combined logistic regression method with above models under multiple parameter tuning effort.
# data import and cleaning
import akshare as ak
import numpy as np
import pandas as pd
from pandas import DataFrame
import datetime as dt
import re
import random
import collections
from functools import reduce
# plot
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.graph_objects as go
import plotly.express as px
import plotly.figure_factory as ff
# math and scipy
import math
import scipy.integrate
from scipy.optimize import curve_fit
# sklearn
from sklearn.model_selection import train_test_split, KFold, cross_val_score
from sklearn.linear_model import LinearRegression, ElasticNet, Lasso, Ridge
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import mean_squared_error
from sklearn.svm import SVR
from sklearn.preprocessing import PolynomialFeatures
from sklearn import metrics
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline
# get data from github by akshare API
global_accumulate = ak.covid_19_csse_daily(date="2020-04-30")
global_accumulate.head()
# drop some unnecessary columns
global_accumulate = global_accumulate.drop(['FIPS', 'Admin2', 'Last_Update', 'Combined_Key', 'Active'] ,axis=1)
global_accumulate = global_accumulate.rename(columns={'Province_State': 'Provice/State',
'Country_Region': 'Country/Region','Long_': 'Long'})
global_accumulate
global_accumulate.isna().sum()
# if we do not know the latitude, longtitude and state, we just fill them with unknow
global_accumulate.fillna('unknow', inplace=True)
global_accumulate.isna().sum()
global_accumulate.duplicated().sum()
global_accumulate.drop_duplicates(inplace=True)
global_accumulate.duplicated().sum()
len(global_accumulate['Country/Region'].unique())
# group by country
global_accumulate_byC = global_accumulate.groupby('Country/Region').agg({'Confirmed': 'sum',
'Deaths': 'sum',
'Recovered': 'sum'}).sort_values('Confirmed', ascending=False)
global_accumulate_byC.head()
# Choropleth Map COVID-19 confirmed cases around the world
global_accumulate_map = global_accumulate_byC.reset_index()[['Country/Region', 'Confirmed']]
colors = ["#F9F9F5", "#FAFAE6", "#FCFCCB", "#FCFCAE", "#FCF1AE", "#FCEA7D", "#FCD97D",
"#FCCE7D", "#FCC07D", "#FEB562", "#F9A648", "#F98E48", "#FD8739", "#FE7519",
"#FE5E19", "#FA520A", "#FA2B0A", "#9B1803", "#861604", "#651104", "#570303",]
fig = go.Figure(data=go.Choropleth(
locationmode = "country names",
locations = global_accumulate_map['Country/Region'],
z = global_accumulate_map['Confirmed'],
text = global_accumulate_map['Confirmed'],
colorscale = colors,
autocolorscale=False,
reversescale=False,
colorbar_title = 'Reported Covid-19 Cases',
))
fig.update_layout(
title_text='COVID19 Confirmed Cases By Country',
geo=dict(
showcoastlines=True,
),
)
# get the top ten countries in list for further use
top_10 = global_accumulate_byC.index.tolist()[:10]
# Total confirmed cases in top ten countries
plt.figure(figsize=(15,5))
sns.set(font_scale=1.5)
sns.barplot(x=global_accumulate_byC.Confirmed[:10],y=global_accumulate_byC.index[:10])
plt.title('Total Confirmed Cases In Top Ten Countries', pad=20, fontsize=20);
# Total death cases in top ten countries
plt.figure(figsize=(15,5))
sns.barplot(x=global_accumulate_byC.sort_values('Deaths', ascending=False).Deaths[:10],y=global_accumulate_byC.sort_values('Deaths', ascending=False).index[:10])
plt.xlabel('Deaths')
plt.title('Total Deaths In Top Ten Countries', pad=20, fontsize=20);
# plot the death rate in top 10 countries
global_accumulate_byC['Death Rate'] = global_accumulate_byC.Deaths/global_accumulate_byC.Confirmed
plt.figure(figsize=(10,4.8))
ax = sns.barplot(x=global_accumulate_byC.sort_values('Death Rate', ascending=False).iloc[:10, 3], y=global_accumulate_byC.sort_values('Death Rate', ascending=False).index[:10])
ax.set_title('Fatality In Top Ten Countries', pad=20, fontsize=20)
# set individual bar lables
for i in ax.patches:
# get_width pulls left or right; get_y pushes up or down
ax.text(i.get_width()+0.001, i.get_y()+0.5, str(round(i.get_width()*100, 2))+'%', fontsize=12)
global_confirmed_daily = ak.covid_19_csse_global_confirmed()
global_confirmed_daily.head()
# we do not care about the state and the location of the state
global_confirmed_daily = global_confirmed_daily.drop(['Province/State', 'Lat', 'Long'], axis=1)
global_confirmed_daily.head()
global_confirmed_daily_melt = pd.melt(global_confirmed_daily, id_vars=['Country/Region'])
global_confirmed_daily_melt.rename(columns={'variable': 'Date', 'value': 'Confirmed'}, inplace=True)
global_confirmed_daily_melt['Date'] = pd.to_datetime(global_confirmed_daily_melt['Date'])
global_confirmed_daily_melt
# get the top 10 confirmed countries
daily_confirmed_top10 = global_confirmed_daily_melt[global_confirmed_daily_melt['Country/Region'].apply(lambda c: c in top_10)]
# group by country and date then get sum of confirmed values
daily_confirmed_top10 = daily_confirmed_top10.groupby(['Country/Region','Date'])['Confirmed'].sum().reset_index()
# get the line plot
plt.figure(figsize=(20,10))
axis = sns.lineplot(x='Date', y='Confirmed', hue='Country/Region', data=daily_confirmed_top10)
axis.xaxis.set_major_locator(plt.MaxNLocator(10))
axis.set_xlabel('')
axis.set_title('Daily Confirmed In Top 10 Countries ',fontdict={'fontsize': 25, 'fontweight': 700, 'color': 'maroon'} ,pad=20);
China is not in top ten countries anymore, but it is the first country who is aware of this pandemic, so we put it here for consideration.
# read new cases data
new_case_daily = pd.read_csv('new_cases.csv')
new_case_daily
# get top ten countries and convert the date column to datetime
top_10_iso = new_case_daily[new_case_daily['location'] != 'World'].groupby('iso_code')['total_cases'].max().sort_values(ascending=False)[:10]
new_case_top10 = new_case_daily[new_case_daily['iso_code'].apply(lambda c: c in top_10_iso)]
new_case_top10.date = pd.to_datetime(new_case_top10.date)
# plot the daily new cases based on three days rolling
fig = px.line(new_case_top10, x="date", y=new_case_top10.new_cases.rolling(3).mean(), color='location')
fig.update_layout(title='Daily New Confirmed In Top 10 Countries (rolling 3-day average)',
xaxis_title='',
yaxis_title='Daily New Cases')
# plot new confirmed cases vs total confirmed case in log scale
fig = px.line(new_case_top10, x="total_cases", y=new_case_top10.new_cases.rolling(3).mean(), color='location', log_x=True, log_y=True)
fig.update_layout(title='Trajectory of World COVID-19 Confirmed Cases (rolling 3-day average)',
xaxis_title='Total Confirmed Cases',
yaxis_title='New Confirmed Cases (3-day average)')
# get the US comfirmed and deaths data with state code and FIPS
us_confirmed = pd.read_csv('us_confirmed.csv')
us_deaths = pd.read_csv('us_deaths.csv')
# merge two dataframes to US
us_confirmed = us_confirmed.iloc[:,[0,1,2,3,-1]]
us_deaths = us_deaths.iloc[:,[1,-1]]
us = pd.merge(us_confirmed, us_deaths, left_index=True, right_index=True)
us.rename(columns={'County Name_x': 'County Name','5/1/20_x': 'Confirmed', '5/1/20_y': 'Deaths'}, inplace=True)
us.drop('County Name_y', axis=1, inplace=True)
us
# Group the data by State code
us_confirmed_deaths = us.groupby('State')['Confirmed', 'Deaths'].sum().reset_index()
us_confirmed_deaths['Text'] = us_confirmed_deaths['State']+ '<br>' + 'Confirmed: '+ \
us_confirmed_deaths['Confirmed'].astype('str') + '<br>' +'Deaths: ' + us_confirmed_deaths['Deaths'].astype('str')
us_confirmed_deaths.head()
# plot the map in each state
warnings.filterwarnings("ignore")
colors = ["#F9F9F5", "#FAFAE6", "#FCFCCB", "#FCFCAE", "#FCF1AE", "#FCEA7D", "#FCD97D",
"#FCCE7D", "#FCC07D", "#FEB562", "#F9A648", "#F98E48", "#FD8739", "#FE7519",
"#FE5E19", "#FA520A", "#FA2B0A", "#9B1803", "#861604", "#651104", "#570303",]
fig = go.Figure(data=go.Choropleth(
locationmode = "USA-states",
locations = us_confirmed_deaths['State'],
z = us_confirmed_deaths['Confirmed'],
text = us_confirmed_deaths['Text'],
colorscale = colors,
autocolorscale=False,
reversescale=False,
colorbar_title = 'Reported Cases',
))
fig.update_layout(
title_text='Spread Of The COVID-19 Confirmed Cases In The U.S. States',
geo = dict(
scope='usa',
projection=go.layout.geo.Projection(type = 'albers usa'),
showlakes=True, # lakes
lakecolor='rgb(255, 255, 255)'),
)
# drop the rows whose FIPS are 0
us_county = us[us.countyFIPS != 0]
us_county.head()
# plot the map about each county in US
us_county['countyFIPS'] = us_county['countyFIPS'].apply(lambda x: str(x).zfill(5))
colorscale = ["#f7fbff", "#ebf3fb", "#deebf7", "#d2e3f3", "#c6dbef", "#b3d2e9", "#9ecae1",
"#85bcdb", "#6baed6", "#57a0ce", "#4292c6", "#3082be", "#2171b5", "#1361a9",
"#08519c", "#0b4083", "#08306b"
]
endpts = list(np.linspace(1, 100, len(colorscale) - 1))
fips = us_county['countyFIPS'].tolist()
values = us_county['Confirmed'].tolist()
fig = ff.create_choropleth(
fips=fips, values=values, scope=['usa'],
binning_endpoints=endpts, colorscale=colorscale,
show_state_data=False,
show_hover=True,
centroid_marker={'opacity': 0},
asp=2.9,
title='Spread Of The COVID-19 Confirmed Cases In The U.S. Counties',
legend_title='Reported Cases'
)
fig.layout.template = None
fig.show()
# Total Confirmed Cases in top 15 State
plt.figure(figsize=(15,5))
sns.set(font_scale=1.2)
ax = sns.barplot(x=us_confirmed_deaths.sort_values('Confirmed', ascending=False).Confirmed[:15],y=us_confirmed_deaths.sort_values('Confirmed', ascending=False).State[:15])
plt.xlabel('Total Confirmed')
plt.title('Total Confirmed Cases In Top Fifteen States In US', pad=20, fontsize=20)
# set individual bar lables
for i in ax.patches:
# get_width pulls left or right; get_y pushes up or down
ax.text(i.get_width()+0.001, i.get_y()+0.5, int(i.get_width()), fontsize=12);
# Total Deaths in top 15 State
plt.figure(figsize=(15,5))
ax = sns.barplot(x=us_confirmed_deaths.sort_values('Deaths', ascending=False).Deaths[:15],y=us_confirmed_deaths.sort_values('Deaths', ascending=False).State[:15])
plt.xlabel('Total Deaths')
plt.title('Total Deaths In Top Fifteen States In US', pad=20, fontsize=20)
# set individual bar lables
for i in ax.patches:
# get_width pulls left or right; get_y pushes up or down
ax.text(i.get_width()+0.001, i.get_y()+0.5, int(i.get_width()), fontsize=12);
SIR disease model
def func_I(t, beta):
return math.e ** ((beta - 1 / 14) * t)
xdata = [0, 57, 58, 59, 60, 61]
ydata = [3, 239383, 243365, 245847, 250062, 254462]
popt, pcov = curve_fit(func_I, xdata, ydata)
beta = popt[0]
gamma = 1 / 14
print('beta:', beta, 'gamma:', gamma)
# Total population, N.
N = 19450000
# Initial number of infected and recovered individuals, I0 and R0.
I0, R0 = 3, 0
# Everyone else, S0, is susceptible to infection initially.
S0 = N - I0 - R0
# A grid of time points (in days)
t = np.linspace(0, 180, 180)
# The SIR model differential equations.
def deriv(y, t, N, beta, gamma):
S, I, R = y
dSdt = -beta * S * I / N
dIdt = beta * S * I / N - gamma * I
dRdt = gamma * I
return dSdt, dIdt, dRdt
# Initial conditions vector
y0 = S0, I0, R0
# Integrate the SIR equations over the time grid, t.
ret = scipy.integrate.odeint(deriv, y0, t, args=(N, beta, gamma))
S, I, R = ret.T
# Plot the data on three separate curves for S(t), I(t) and R(t)
fig = plt.figure(figsize=(15,5))
ax = fig.add_subplot(111, axisbelow=True)
ax.plot(t, S/1000000, 'b', alpha=0.5, lw=2, label='Susceptible')
ax.plot(t, I/1000000, 'r', alpha=0.5, lw=2, label='Infected')
ax.plot(t, R/1000000, 'y', alpha=0.5, lw=2, label='Recovered with immunity')
ax.set_xlabel('Time /days')
ax.set_ylabel('Number (million)')
ax.yaxis.set_tick_params(length=0)
ax.xaxis.set_tick_params(length=0)
ax.grid(b=True, which='major', c='w', lw=2, ls='-')
legend = ax.legend()
legend.get_frame().set_alpha(0.5)
for spine in ('top', 'right', 'bottom', 'left'):
ax.spines[spine].set_visible(False)
Based on 27.9% infection rate and 7.1% recover rate, the infected will hit 7.5 million peak in New York State.
covid_19_data_F = pd.read_csv('covid_19_clean_complete.csv')
time_series_covid_19_confirmed_global_F = pd.read_csv('time_series_covid_19_confirmed.csv')
time_series_covid_19_confirmed_US_F=pd.read_csv('time_series_covid_19_confirmed_US.csv')
# Visualizing the number of confirmed cases in the US by day
list2_F=time_series_covid_19_confirmed_US_F.iloc[:,11:].sum(axis=0)
time_series_covid_19_confirmed_US_F = time_series_covid_19_confirmed_US_F.append(list2_F, ignore_index=True)
time_series_covid_19_confirmed_US_T=time_series_covid_19_confirmed_US_F.T
time_series_covid_19_confirmed_US_T.iloc[11:,3253].dropna().plot()
plt.ylabel('Number of confirmed cases by day')
plt.xlabel('Days from 1/22/20')
plt.title('Number of Confirmed cases in the US');
# Create a new dataframe only shows the total number of confirmed cases in US by day and for whole state and the day time accordingly in order to make predictive analysis later.
list2_US=time_series_covid_19_confirmed_US_T[3253].dropna()
Number_of_cases_US=list2_US.to_frame().reset_index()
Number_of_cases_US=Number_of_cases_US.drop('index',axis=1)
Number_of_cases_US=Number_of_cases_US.rename(columns={3253: "number_of_confirmed_cases_US"})
len(Number_of_cases_US)
For all predictive models:
Change standardized date format to integer start from 1 to 77 to have a better visualization later.
Because the date of our data analysis is from 01/22 to 04/06, a total of 77 days.
Choose 10 days for future prediction
days_in_future = 10
future_forcast = np.array([i for i in range(len(Number_of_cases_US)+days_in_future)]).reshape(-1, 1)
adjusted_dates = future_forcast[:-10]
start = '1/22/2020'
start_date = dt.datetime.strptime(start, '%m/%d/%Y')
future_forcast_dates = []
for i in range(len(future_forcast)):
future_forcast_dates.append((start_date + dt.timedelta(days=i)).strftime('%m/%d/%Y'))
# Split arrays into training data and test data which will be randomly selected
X_F1 =list(range(1, 77))
X_F1 = np.array(X_F1).reshape(-1, 1)
y_U = Number_of_cases_US['number_of_confirmed_cases_US']
y_U = np.array(Number_of_cases_US['number_of_confirmed_cases_US'])
X_train_confirmed_U, X_test_confirmed_U, y_train_confirmed_U, y_test_confirmed_U = train_test_split(X_F1, y_U, test_size=0.25, shuffle=False)
poly_U = PolynomialFeatures(degree=6)
poly_X_train_confirmed_U = poly_U.fit_transform(X_train_confirmed_U)
poly_X_test_confirmed_U = poly_U.fit_transform(X_test_confirmed_U)
poly_future_forcast_U = poly_U.fit_transform(future_forcast)
linear_model_U = LinearRegression(normalize=True, fit_intercept=False)
linear_model_U.fit(poly_X_train_confirmed_U, y_train_confirmed_U)
test_linear_pred_U = linear_model_U.predict(poly_X_test_confirmed_U)
linear_pred_U = linear_model_U.predict(poly_future_forcast_U)
print('MAE:', mean_absolute_error(test_linear_pred_U, y_test_confirmed_U))
print('MSE:',mean_squared_error(test_linear_pred_U, y_test_confirmed_U))
In the SVM model, there are a number of parameters that need to be tune, such as kernel which including 'linear’, ‘poly’ 'sigmoid' and so on, degree(can be used by a lot of certain number), gamma, and shrinking. The SVM model needs to adjust all these parameters to achieve the optimal fitting result.
svm_confirmed_U = SVR(shrinking=True, kernel='poly',gamma=0.01, epsilon=1,degree=6, C=0.1)
svm_confirmed_U.fit(X_train_confirmed_U, y_train_confirmed_U)
svm_pred_U = svm_confirmed_U.predict(future_forcast)
svm_test_pred_U = svm_confirmed_U.predict(X_test_confirmed_U)
print('MAE:', mean_absolute_error(svm_test_pred_U, y_test_confirmed_U))
print('MSE:',mean_squared_error(svm_test_pred_U, y_test_confirmed_U))
# plot the graph of SVM model and Linear Regression
plot1=plt.plot(X_F1,y_U,label="Number of confirmed cases actual in US", color='purple')
plot2=plt.plot(future_forcast,svm_pred_U,label='SVM Model in US', color ='pink',linestyle='dashed')
plot3=plt.plot(future_forcast,linear_pred_U, label='Linear Regression Model in US',color ='green',linestyle='dashed')
plt.xlabel('for days')
plt.ylabel('Number of confirmed cases in US')
plt.title('Actual vs. Linear Regression vs. SVM prediciton of confirmed cases in US')
plt.legend();
In order to achieve a better fitting effect, we not only used the logistics model, but referred to the model specially used for the study of infectious diseases. From this logistic increase model, It is to use the existing data to fit the above equation, P(t) is number of people confirmed; K means maximum number; R: growth resistance (e.g. resistance to the growth of an outbreak caused by medical isolation). As can be seen from the visualizations, compared with other basic models, this curve has the best good fitting effect.
def logistic_increase_function(t, K, P0, r):
r=0.29
t0 = 1
exp_value = np.exp(r * (t - t0))
return (K * exp_value * P0) / (K + (exp_value - 1) * P0)
t_F1 = list(range(1, 77))
t_F1 = np.array(t_F1)
P_US = Number_of_cases_US['number_of_confirmed_cases_US']
P_US = np.array(P_US)
popt, pocv = curve_fit(logistic_increase_function, t_F1, P_US)
P_US_predict=logistic_increase_function(t_F1,popt[0],popt[1],popt[2])
print('MAE:', mean_absolute_error(P_US, P_US_predict))
print('MSE:',mean_squared_error(P_US, P_US_predict))
# Visualizations for actual trend line and predictive lines under the Logistic Regression
plot1=plt.plot(t_F1,P_US,label="Number of confirmed cases actual")
plot2=plt.plot(t_F1,P_US_predict,label='logistic Regression Model')
plt.xlabel('for days')
plt.ylabel('Number of confirmed cases')
plt.legend()
plt.title('Actual vs. Logistic prediction of confirmed cases in US');
By comparing the absolute error values and mean squared error of different models, we can find that the logistic model is the most consistent with our actual confirmed cases lines. Again, the main reason that the logistics model is more closely related to the formulation used infectious diseases.\ The linear model is the most commonly used and familiar model, and its fitting is ok. I also put the Polyfeature to be used, so there is still one parameter to be adjusted.
As for the SVM model, there are many parameters that need to be adjusted, such as the kernels, gamma,shrinking and so on. Due to the limited data and limited time, the optimal parameters may not be found well, but I think SVM can achieve better results.
MAE = {
'US':[mean_absolute_error(test_linear_pred_U, y_test_confirmed_U),mean_absolute_error(svm_test_pred_U, y_test_confirmed_U),mean_absolute_error(P_US, P_US_predict)]
}
MAE_table = pd.DataFrame(MAE)
MAE_table.index = ['Linear Model', 'SVM Model', 'Logistic Model']
MAE_table
MSE = {
'US':[mean_squared_error(test_linear_pred_U, y_test_confirmed_U),mean_squared_error(svm_test_pred_U, y_test_confirmed_U),mean_squared_error(P_US, P_US_predict)]
}
MSE_table = pd.DataFrame(MSE)
MSE_table.index = ['Linear Model', 'SVM Model', 'Logistic Model']
MSE_table
By comparing the absolute error values and mean squared error of different models, we can find that the logistic model is the most consistent with our actual confirmed cases lines. Again, the main reason that the logistics model is more closely related to the formulation used infectious diseases. The linear model is the most commonly used and familiar model, and its fitting is ok. We also put the Polyfeature to be used, so there is still one parameter to be adjusted. As for the SVM model, even these four difference lines have different parameters to adjust to achieve the best performance.In addition, there are many parameters that need to be adjusted, such as the kernels, gamma,shrinking and so on. Due to the limited data and limited time, the optimal parameters may not be found well, but I think SVM can achieve better results.
#read country code
country_code=pd.DataFrame(pd.read_csv("wikipedia-iso-country-codes.csv").rename(columns={'English short name lower case':'Country/Territory',"Alpha-3 code":"Code"})[["Country/Territory","Code"]])
covid_data=pd.read_csv("covid_19_clean_complete.csv")
covid=pd.DataFrame(covid_data[covid_data['Date']=='4/23/20'])
covid19 = covid.groupby(by='Country/Region').agg({'Confirmed': 'sum','Deaths': 'sum','Recovered':'sum'}).sort_values(by="Confirmed",ascending=False).reset_index().rename(columns={'Country/Region':'Country/Territory'})
covid19.loc[0,"Country/Territory"]="United States"
#read confirm number wirh country and code
confirm_num=pd.merge(covid19,country_code,on='Country/Territory',how='left').reset_index(drop=True).sort_values(by="Confirmed",ascending=False)
As we know, senior people have higher risk infecting with covid19. We read the data that the population aged over 65 of each country in percentage.
aging=pd.read_html('https://en.wikipedia.org/wiki/List_of_countries_by_age_structure',header=1)[0][["Country",'age over 65 years[3]']].rename(columns={"Country":"Country/Territory","age over 65 years[3]":"Over_65"})
aging=pd.merge(aging,country_code,on='Country/Territory',how='left').dropna()
aging.head()
For the current 2020 fiscal year, low-income economies are defined as those with a GNI per capita, calculated using the World Bank Atlas method. $1,025 or less in 2018; lower middle-income economies are those with a GNI per capita between $1,026 and $3,995; upper middle-income economies are those with a GNI per capita between $3,996 and $12,375; high-income economies are those with a GNI per capita of $12,376 or more.
Country_Class=pd.DataFrame(pd.read_csv("Income Class.csv",index_col="CountryName").dropna().rename_axis("Country/Territory").rename(columns={"CountryCode":'Code'})).drop(columns="GroupName")
Country_Class.head()
Estimate of governance (ranges from approximately -2.5 (weak) to 2.5 (strong) governance performance)
WGI=pd.DataFrame(pd.read_csv("WGI.csv",index_col="Country/Territory")).rename({"Estimate":"WGI"},axis="columns")
WGI.head()
This is a list of countries and dependent territories ranked by population density, measured by the number of human inhabitants per square kilometer, and also sortable by total area and by population.
pop_dens=pd.read_csv("API_EN.POP.DNST_DS2_en_csv_v2_988966.csv",index_col='Country Name')[['Country Code','2018']].rename_axis("Country/Territory").rename(columns={'Country Code':"Code","2018":"Pop_dens"})
pop_dens.head()
Year of Average Years of Tertirary Schooling Attained
barro_lee=pd.read_csv("Education lelvel.csv",index_col='country')[["WBcode","yr_sch_ter"]].rename_axis('Country/Territory').rename(columns = {'WBcode':'Code'})
barro_lee.head()
Health spending measures the final consumption of health care goods and services (i.e. current health expenditure) including personal health care (curative care, rehabilitative care, long-term care, ancillary services and medical goods) and collective services (prevention and public health services as well as health administration)
health_exp=pd.read_csv("2.12_Health_systems.csv")[["World_Bank_Name",'Health_exp_per_capita_USD_2016']].rename(columns={"World_Bank_Name":'Country/Territory','Health_exp_per_capita_USD_2016':'Health_exp_per_capita_USD'})
health_exp=pd.merge(country_code,health_exp,on='Country/Territory',how='left')
health_exp.sort_values(by='Code').head()
# Merge Multiple df: https://stackoverflow.com/questions/44327999/python-pandas-merge-multiple-dataframes/44338256
data_frames = [confirm_num, aging, Country_Class,WGI,pop_dens,barro_lee,health_exp]
df_merged = reduce(lambda left,right: pd.merge(left,right,on=['Code'],
how='left'), data_frames)
df_merge=df_merged.drop(columns=["Country/Territory_x","Country/Territory_y"]).set_index("Country/Territory")
data=df_merge.sort_values(by="Confirmed",ascending=False).dropna()
data.head()
data=data.reset_index().drop(columns=["Country/Territory","Deaths","Recovered","Code"])
data["Over_65"]=data["Over_65"].apply(lambda x: np.nan if x in ['-'] else x[:-1]).astype(float)/10
cat_col = data.select_dtypes(exclude=np.number)
cat_col_encoded = pd.get_dummies(cat_col,drop_first=True)
num_cols = data.select_dtypes(include = np.number)
df = pd.concat([num_cols,cat_col_encoded],sort=False,axis=1)
z=df.drop(columns="Confirmed",axis=1)
X=z.values
y = df['Confirmed'].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = .1)
lr = LinearRegression()
lr.fit(X_train, y_train)
for i, name in enumerate(z.columns): #enumerate adds a number to an iteration
print(f'{name:>2}: {lr.coef_[i]}')
print()
# Intercept is the estimator’s **`intercept_`** attribute (**b** in the equation)
print(lr.intercept_)
country_group=data.groupby('GroupCode').sum()['Confirmed']
group=country_group.plot(kind='bar',rot=0,figsize=(10,10))
group.set_xlabel('Country Income Group')
group.set_ylabel('Confirmed')
group.set_title('Confirmed by Income Group', fontdict={'fontsize': 20, 'fontweight': 700, 'color': 'maroon'}, pad=20)
plt.xticks(rotation='horizontal')
for p in group.patches:
height = p.get_height()
group.annotate('{}'.format(height),xy=(p.get_x() + p.get_width() / 2, height), ha='center', va='bottom')
As we can tell from this plot, the confirmed cases in the high income class is much higher than other two classes. The LIC only have less than 4000 confirmed in the whole class. As we know Covid19 is a global outbreak started from China, in the bigning, countries have more multinational enterprises required a lot of international travel has higher risks than countries whose economies are not very active.
Linear regression model is not a good one for predicting the confirmed number of each country. But some of its inference is meaningful for us to understand how Covid-19 will affect different contries. The following are factors inferences I got from my analysis:
This dataset contains the death information of every new pandemic/epidemic in the 21st century. It stores the number of death of each disease in a span of 800 days (124 days for COVID-19). By examining this dataset, we could analyze the increasing modes among the epidemics and gain insights over how deadly COVID-19 would end up to be.
df = pd.read_csv('Disease Data.csv')
df.head()
lines1 = df.plot.line(x='NAME', y=df.columns[1:8], logy = False)
Because the scales of the number of death are different, only Swine Flu are in similar scale with COVID-19 (Well, for now). We are going to implement a log scale of dependent variable to see better details of the comparison. We can still clearly see though how extremely deadly COVID-19 is compared to other epidemics.
lines = df.plot.line(x='NAME', y=df.columns[1:8], logy = True)
In this log scale plot, we can clearly see that COVID-19 was not the fastest killer as it did not surpass all other epidemics until around day 50. This finding agrees with our knowledge that COVID-19 has a relatively long incubation period compared to other viruses. We can also see a slowdown of its death toll at around day 50, which matches COVID-19's successful containment in China in late Feburary and it was causing too big a problem globally at that time.
After conducting all the analysis over the data and answering the business questions we threw out, we can now summarize our findings and validate them. We found out that different countries have their own increasing patterns due to strictness of their regulation and quarantine policy. By comparing different models' performance, we see that logistic model generally fits the data most accurately as it out-performs SVM and linear models in predicting total confirmed cases trends. This can be validated by China's case pattern as this pandemic is at its end in China. By utilizing SIR model, we also figure out that the real infected number is at least twice higher than reported, which can be validated by Governor Cuomo's assertion that NY's actual cases are much higher than reported. Governor Cuomo and his team in NY conducted antibody test randomly in NY's supermarket for 30000 people and their result exhibits a 13.4% antibody cases. If we apply this number to other area, we can conclude that the real case number is at least 2 or 3 times higher.
It is almost certainly that COVID-19 is the most severe pandemic humanity has faced since at least a century ago. Its high transmission ability and relatively high lethality put global healthcare service under huge pressure. Under current circumstance, every one of us should keep practicing social distancing and help the healthcare system. Fortunately, we are seeing another slowdown happening in above log scaled graph. This suggests that the number of death stopped accelerating but still stablized at a scary high level. We hope our little research can contribute towards global effort over fighting aginst COVID-19.